I enjoyed the soft landing on this course. It was comforting to start
and go through the very basics.
From this course, I expect to learn to integrate the use of github and R
markdown in my daily work. I also expect to learn the best practices in
open data science. I heard about this course through
the email list of my doctoral program.
So far the book and exercises have been a great crash course to R tools and RStudio.I enjoy the style of writing in the book. My favorite section so far is the chapter 4 focusing on plots. I found it useful.
My GitHub repository: https://github.com/kattilakoski/IODS-project
The dataset used in this assignment is from a survey of learning approaches and achievements of students on an introductory statistics course. In this assignment we use a subset of the results including information on gender, age, attitude, total points and approaches (deep, surface and strategic) of 166 students.
# access the GGally and ggplot2 libraries
library(GGally)
library(ggplot2)
library(tidyverse)
#Read the data in:
learning2014 <- read_csv("./data/learning2014.csv")
#Explore dimensions:
dim(learning2014)
## [1] 166 7
#and structure
str(learning2014)
## spc_tbl_ [166 Ă— 7] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ gender : chr [1:166] "F" "M" "F" "M" ...
## $ age : num [1:166] 53 55 49 53 49 38 50 37 37 42 ...
## $ attitude: num [1:166] 3.7 3.1 2.5 3.5 3.7 3.8 3.5 2.9 3.8 2.1 ...
## $ deep : num [1:166] 3.58 2.92 3.5 3.5 3.67 ...
## $ stra : num [1:166] 3.38 2.75 3.62 3.12 3.62 ...
## $ surf : num [1:166] 2.58 3.17 2.25 2.25 2.83 ...
## $ points : num [1:166] 25 12 24 10 22 21 21 31 24 26 ...
## - attr(*, "spec")=
## .. cols(
## .. gender = col_character(),
## .. age = col_double(),
## .. attitude = col_double(),
## .. deep = col_double(),
## .. stra = col_double(),
## .. surf = col_double(),
## .. points = col_double()
## .. )
## - attr(*, "problems")=<externalptr>
# draw a scatter plot matrix of the variables in learning2014.
# [-1] excludes the first column (gender)
pairs(learning2014[-1])
# create a more advanced plot matrix with ggpairs()
p <- ggpairs(learning2014, mapping = aes(col = gender, alpha = 0.3), lower = list(combo = wrap("facethist", bins = 20)))
# draw the plot
p
In the first scatter plot matrix each variable is plotted against each others. The variables are written in a diagonal in the plot. In the plots on the same row with the variable text box the variable is on the y axis of the plot. In the plots in the same column with the variable text box the variable is on the x-axis of the plot.
Based on this scatter plot matrix there are no clearly visible correlations between any of the variables. However, some seem to have some kind of upward or downward trend such as attitude and points as well as deep vs surface strategies.
In the second plot matrix created with ggpairs() the variables are on the top and left side of the plot. The top right side of the matrix shows the correlation between the continuous variables (age, attitude, deep, strategic, surface), the lower left side shows the scatter plots of the continuous variables, the diagonal the density plots of the continuous variables, and the leftmost column shows the histograms and box plots for the combinations between the categorical and the continuous variables.
From the matrix we can see that majority of the interviewees were female and around 20 years old. Age was not significantly correlating with any variable. Attitude had significant positive correlation with points and significant negative correlation with surface approach. Deep approach had significant negative correlation with surface approach. Strategic approach had negative correlation with surface approach. This matrix confirmed the trends that were visible in the previous scatter plot matrix
# create a regression model with three explanatory variables
my_model2 <- lm(points ~ attitude + stra + surf, data = learning2014)
# print out a summary of the model
summary(my_model2)
##
## Call:
## lm(formula = points ~ attitude + stra + surf, data = learning2014)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.1550 -3.4346 0.5156 3.6401 10.8952
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.0171 3.6837 2.991 0.00322 **
## attitude 3.3952 0.5741 5.913 1.93e-08 ***
## stra 0.8531 0.5416 1.575 0.11716
## surf -0.5861 0.8014 -0.731 0.46563
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.296 on 162 degrees of freedom
## Multiple R-squared: 0.2074, Adjusted R-squared: 0.1927
## F-statistic: 14.13 on 3 and 162 DF, p-value: 3.156e-08
# create a regression model without the variables that did not have statistically significant relationship with points
my_model3 <- lm(points ~ attitude, data = learning2014)
# print out a summary of the model
summary(my_model3)
##
## Call:
## lm(formula = points ~ attitude, data = learning2014)
##
## Residuals:
## Min 1Q Median 3Q Max
## -16.9763 -3.2119 0.4339 4.1534 10.6645
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.6372 1.8303 6.358 1.95e-09 ***
## attitude 3.5255 0.5674 6.214 4.12e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.32 on 164 degrees of freedom
## Multiple R-squared: 0.1906, Adjusted R-squared: 0.1856
## F-statistic: 38.61 on 1 and 164 DF, p-value: 4.119e-09
Based on the results of the t-test used in the linear regression the only significant correlation seems to be a positive correlation between attitude and points (P<0.001). The t-test compares the means of the two groups and tests whether the differences between them are related. As both the strategic and surface variables had a p-value greater than 0.1 they did not have statistically significant relationship with points variable.
The multiple R-squared for the first regression model with all three explanatory variables is 0.2074. This means that only little of the variability in the points is explained by these explanatory variables. Even though the relationship between attitude and points is significant the R-squared value of the regression model including only them is even smaller at 0.1906 meaning that attitude explains even less of the variability in the points.
# draw diagnostic plots using the plot() function. Choose the plots 1, 2 and 5 for Residuals vs Fitted values, Normal QQ-plot and Residuals vs Leverage
par(mfrow = c(2,2))
plot(my_model2, c(1,2,5))
#same for the model with only attitude as the explanatory variable
par(mfrow = c(2,2))
plot(my_model3, c(1,2,5))
It seems that there is no distinctive pattern in the Residuals vs fitted plots in either case meaning that there is no non-linear relationship between the explanatory and target variables.
The normal Q-Q plots show somewhat straight lines following the dotted line with exceptions in the beginning and end. These exceptions may indicate that the data is not following a normal distribution.
In the residuals vs leverage plots there would be dotted lines in the upper and lower right corners showing if some of the observations had high Cook’s distance scores meaning they are influential to the regression results and are not following the trend in the rest of the cases. In these plots none of the cases seem to be very influential to the regression results.
#install packages:
# install.packages("boot")
# install.packages("readr")
# access libraries
library(dplyr); library(ggplot2); library(readr); library(tidyr); library(boot)
#Read the joined student alcohol consumption data into R
alc <- read_csv("./data/student_alc.csv")
## Rows: 370 Columns: 35
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (17): school, sex, address, famsize, Pstatus, Mjob, Fjob, reason, guardi...
## dbl (17): age, Medu, Fedu, traveltime, studytime, famrel, freetime, goout, D...
## lgl (1): high_use
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# Print out the names of the variables in the data
colnames(alc)
## [1] "school" "sex" "age" "address" "famsize"
## [6] "Pstatus" "Medu" "Fedu" "Mjob" "Fjob"
## [11] "reason" "guardian" "traveltime" "studytime" "schoolsup"
## [16] "famsup" "activities" "nursery" "higher" "internet"
## [21] "romantic" "famrel" "freetime" "goout" "Dalc"
## [26] "Walc" "health" "failures" "paid" "absences"
## [31] "G1" "G2" "G3" "alc_use" "high_use"
This data set includes data on student achievements in two Portuguese schools. It includes 370 observations and 35 variables. Variables include for example student grades, sex and age. The dataset was formed by combining two datasets, one regarding the student performance in mathematics and the other in Portuguese language.
###Choosing variables for this analysis For this analysis I choose to focus on age, mother’s education, study time and absences. My hypothesis is that younger students drink less than older ones, the higher the education level of the mother the lower the alcohol consumption of the student is, students studying more drink less alcohol than students studying less and that the more absences the student has the more they drink.
#distribution of alcohol consumption and age
d1 <- ggplot(alc, aes(x = high_use, y = age)) + geom_boxplot() + ggtitle("Age and alcohol consumption")
d1
#distribution of alcohol consumption and mother's education
d2 <- ggplot(alc, aes(x = high_use, y = Medu)) + geom_boxplot() + ggtitle("Mother's education and student alcohol consumption")
d2
#distribution of alcohol consumption and study time
d3 <- ggplot(alc, aes(x = high_use, y = studytime)) + geom_boxplot() + ggtitle("Studytime and alcohol consumption")
d3
#distribution of alcohol consumption and absences
d4 <- ggplot(alc, aes(x = high_use, y = absences)) + geom_boxplot() + ggtitle("Absences and alcohol consumption")
d4
#Summary statistics
#age
# produce summary statistics by group
alc %>% group_by(high_use) %>% summarise(count = n(), mean(age))
## # A tibble: 2 Ă— 3
## high_use count `mean(age)`
## <lgl> <int> <dbl>
## 1 FALSE 259 16.5
## 2 TRUE 111 16.8
#mother's aducation
alc %>% group_by(high_use) %>% summarise(count = n(), mean(Medu))
## # A tibble: 2 Ă— 3
## high_use count `mean(Medu)`
## <lgl> <int> <dbl>
## 1 FALSE 259 2.80
## 2 TRUE 111 2.79
#study time
alc %>% group_by(high_use) %>% summarise(count = n(), mean(studytime))
## # A tibble: 2 Ă— 3
## high_use count `mean(studytime)`
## <lgl> <int> <dbl>
## 1 FALSE 259 2.16
## 2 TRUE 111 1.77
#absences
alc %>% group_by(high_use) %>% summarise(count = n(), mean(absences))
## # A tibble: 2 Ă— 3
## high_use count `mean(absences)`
## <lgl> <int> <dbl>
## 1 FALSE 259 3.71
## 2 TRUE 111 6.38
Based on the distribution plots it looks like my hypothesis is correct and that younger students drink less than older students. When looking at the numerical summary the difference in the mean age is only 0.27 years which seems small. Against my hypothesis it looks like there is no difference in the alcohol use of students with mother’s from different education levels and this is also visible in the summary as the difference between mean education level of mothers is only 0.01. Based on the distributions it looks like my hypothesis on study time was right as the students drinking less seem to study more and this is backed up by the summary as well as there is a difference in the mean study times of students with high and low alcohol use. However, the difference is only 0.40 hours. It seems that my hypothesis was right about the absences as students that drink more seem to have more absences. The difference between absences of students with high and low alchol use is the clearest and the mean absence of students with high alcohol use is 2.67 absences more than students with low alcohol use.
# find the model with glm()
m <- glm(high_use ~ age + Medu + studytime + absences, data = alc, family = "binomial")
# print out a summary of the fitted model
summary(m)
##
## Call:
## glm(formula = high_use ~ age + Medu + studytime + absences, family = "binomial",
## data = alc)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.798207 1.786727 -1.566 0.117323
## age 0.164398 0.103533 1.588 0.112316
## Medu -0.003725 0.111021 -0.034 0.973232
## studytime -0.579521 0.158539 -3.655 0.000257 ***
## absences 0.075027 0.023247 3.227 0.001249 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 452.04 on 369 degrees of freedom
## Residual deviance: 417.24 on 365 degrees of freedom
## AIC: 427.24
##
## Number of Fisher Scoring iterations: 4
# print out the coefficients of the model
coef(m)
## (Intercept) age Medu studytime absences
## -2.798207282 0.164397602 -0.003725333 -0.579521121 0.075027298
# compute odds ratios (OR)
OR <- coef(m) %>% exp
# compute confidence intervals (CI)
CI <- confint(m) %>% exp()
## Waiting for profiling to be done...
# print out the odds ratios with their confidence intervals
cbind(OR, CI)
## OR 2.5 % 97.5 %
## (Intercept) 0.06091918 0.001766207 1.979230
## age 1.17868287 0.963461160 1.447356
## Medu 0.99628160 0.801834321 1.240253
## studytime 0.56016655 0.406319506 0.757511
## absences 1.07791358 1.032143878 1.130799
Based on the negative coefficient estimates of the summary statistics of the fitted model, higher values of mother’s education and student’s study time are associated with a lower likelihood of the high_use variable being TRUE and the opposite for the age and absences variables. However, only the coefficients for study time and absences are statistically significant. Therefore from the summary of the model one could say that the more time time a student spends studying the lower the likelihood of them having high usage of alcohol. And the higher the number of absence days of the student is the higher the likelihood that their alcohol usage is high. Based on these results my hypothesis about mother’s education and age were wrong and the hypothesis on study time and absences were right.
Based on the odds ratio for a one year increase in age the odds that the student have high alcohol usage increases by a factor of 1.18 when other variables stay stable. And the same idea for other variables. So each additional education level the odds that student has high alcohol usage increases by a factor of 1.00 meaning that there is no change in the alcohol usage, for each hour spent studying the odds decrease by a factor of 0.56 and for every increased absence the odds increase by a factor of 1.08 . Based on these results the biggest increases or decreases in odds are caused by age (the higher the age the higher the alcohol usage) and study time (the higher the study time the lower the alcohol usage).The confidence intervals for the intercept, age, Medu, studytime and absences are 0.00-1.98, 0.96-1.45, 0.80-1.24, 0.41-0.76 and 1.03-1.13 respectively. Interpretation for these values are that we are 95% confident that the correlation between the given variable and target variable is between the values of those confidence intervals. For example we are 95% confident that the correlation between age and alcohol usage is between 0.96-1.45. This is pretty high interval and therefore we don’t have very good knowledge of the effect. Same goes for mother’s education. The confidence intervals for study time and absences are smaller meaning we have better knowledge of their effect on the target variable (alcohol use).Based on these results my hypothesis about mother’s education was wrong and the hypotheses on age, study time and absences were right.
# find the model with glm() for only the statistically significant study time and absences
m <- glm(high_use ~ studytime + absences, data = alc, family = "binomial")
# predict() the probability of high_use
probabilities <- predict(m, type = "response")
# add the predicted probabilities to 'alc'
alc <- mutate(alc, probability = probabilities)
# use the probabilities to make a prediction of high_use
alc <- mutate(alc, prediction = probability > 0.5)
# see the last ten original classes, predicted probabilities, and class predictions
select(alc, studytime, absences, high_use, probability, prediction) %>% tail(10)
## # A tibble: 10 Ă— 5
## studytime absences high_use probability prediction
## <dbl> <dbl> <lgl> <dbl> <lgl>
## 1 2 3 FALSE 0.266 FALSE
## 2 1 0 FALSE 0.336 FALSE
## 3 1 7 TRUE 0.468 FALSE
## 4 3 1 FALSE 0.149 FALSE
## 5 1 6 FALSE 0.448 FALSE
## 6 3 2 FALSE 0.159 FALSE
## 7 2 2 FALSE 0.251 FALSE
## 8 2 3 FALSE 0.266 FALSE
## 9 1 4 TRUE 0.410 FALSE
## 10 1 2 TRUE 0.372 FALSE
#2x2 tabulate the target variable versus the predictions
table(high_use = alc$high_use, prediction = alc$prediction)
## prediction
## high_use FALSE TRUE
## FALSE 248 11
## TRUE 93 18
# define a loss function (mean prediction error)
loss_func <- function(class, prob) {
n_wrong <- abs(class - prob) > 0.5
mean(n_wrong)
}
# call loss_func to compute the average number of wrong predictions in the (training) data
loss_func(class = alc$high_use, prob = alc$probability)
## [1] 0.2810811
Based on the cross tabulation/confusion matrix the model predicts farely well the cases were alcohol usage is not high only 11 false positives out of the 259 students that reported low alcohol usage. The model does not predict the high use cases very well as out of the 111 high use cases it only got 18 right and 93 false positives.
The model has a prediction error of 0.28 which means that it makes wrong predictions 28% of the times. This is a rather high prediction error and I would conclude that this model does not predict very well and therefore is not very powerful.
I think the “error rate of my guesses” and of the model were pretty similar. I would say that the model is slightly better for prediction than guessing but would have much room for improvement. This could maybe be done for example by including other variables.
# Install required packages:
#install.packages(c("MASS", "corrplot"))
#install.packages("plotly")
# Access required packages
library(MASS)
##
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
##
## select
library(tidyverse)
library(corrplot)
## corrplot 0.92 loaded
library(ggplot2)
library(dplyr)
library(GGally)
# Load the Boston data from the MASS package
data("Boston")
# Explore the structure and the dimensions of the data
glimpse(Boston) # 506 rows amd 14 columns
## Rows: 506
## Columns: 14
## $ crim <dbl> 0.00632, 0.02731, 0.02729, 0.03237, 0.06905, 0.02985, 0.08829,…
## $ zn <dbl> 18.0, 0.0, 0.0, 0.0, 0.0, 0.0, 12.5, 12.5, 12.5, 12.5, 12.5, 1…
## $ indus <dbl> 2.31, 7.07, 7.07, 2.18, 2.18, 2.18, 7.87, 7.87, 7.87, 7.87, 7.…
## $ chas <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ nox <dbl> 0.538, 0.469, 0.469, 0.458, 0.458, 0.458, 0.524, 0.524, 0.524,…
## $ rm <dbl> 6.575, 6.421, 7.185, 6.998, 7.147, 6.430, 6.012, 6.172, 5.631,…
## $ age <dbl> 65.2, 78.9, 61.1, 45.8, 54.2, 58.7, 66.6, 96.1, 100.0, 85.9, 9…
## $ dis <dbl> 4.0900, 4.9671, 4.9671, 6.0622, 6.0622, 6.0622, 5.5605, 5.9505…
## $ rad <int> 1, 2, 2, 3, 3, 3, 5, 5, 5, 5, 5, 5, 5, 4, 4, 4, 4, 4, 4, 4, 4,…
## $ tax <dbl> 296, 242, 242, 222, 222, 222, 311, 311, 311, 311, 311, 311, 31…
## $ ptratio <dbl> 15.3, 17.8, 17.8, 18.7, 18.7, 18.7, 15.2, 15.2, 15.2, 15.2, 15…
## $ black <dbl> 396.90, 396.90, 392.83, 394.63, 396.90, 394.12, 395.60, 396.90…
## $ lstat <dbl> 4.98, 9.14, 4.03, 2.94, 5.33, 5.21, 12.43, 19.15, 29.93, 17.10…
## $ medv <dbl> 24.0, 21.6, 34.7, 33.4, 36.2, 28.7, 22.9, 27.1, 16.5, 18.9, 15…
str(Boston)
## 'data.frame': 506 obs. of 14 variables:
## $ crim : num 0.00632 0.02731 0.02729 0.03237 0.06905 ...
## $ zn : num 18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
## $ indus : num 2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
## $ chas : int 0 0 0 0 0 0 0 0 0 0 ...
## $ nox : num 0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
## $ rm : num 6.58 6.42 7.18 7 7.15 ...
## $ age : num 65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
## $ dis : num 4.09 4.97 4.97 6.06 6.06 ...
## $ rad : int 1 2 2 3 3 3 5 5 5 5 ...
## $ tax : num 296 242 242 222 222 222 311 311 311 311 ...
## $ ptratio: num 15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
## $ black : num 397 397 393 395 397 ...
## $ lstat : num 4.98 9.14 4.03 2.94 5.33 ...
## $ medv : num 24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
dim(Boston)
## [1] 506 14
The dataset used in this assingment and the explanations for variablenames are found in here: https://stat.ethz.ch/R-manual/R-devel/library/MASS/html/Boston.html The dataset has 506 rows and 14 columns and in includes information on the housing values in suburbs of Boston. The columns include for example per capita crime rate by town (crim) and average number of rooms per dwelling (rm).
# Show a graphical overview of the data
pairs(Boston)
p <- ggpairs(Boston)
# draw the plot
p
# Show summaries of the variables in the data.
summary(Boston)
## crim zn indus chas
## Min. : 0.00632 Min. : 0.00 Min. : 0.46 Min. :0.00000
## 1st Qu.: 0.08205 1st Qu.: 0.00 1st Qu.: 5.19 1st Qu.:0.00000
## Median : 0.25651 Median : 0.00 Median : 9.69 Median :0.00000
## Mean : 3.61352 Mean : 11.36 Mean :11.14 Mean :0.06917
## 3rd Qu.: 3.67708 3rd Qu.: 12.50 3rd Qu.:18.10 3rd Qu.:0.00000
## Max. :88.97620 Max. :100.00 Max. :27.74 Max. :1.00000
## nox rm age dis
## Min. :0.3850 Min. :3.561 Min. : 2.90 Min. : 1.130
## 1st Qu.:0.4490 1st Qu.:5.886 1st Qu.: 45.02 1st Qu.: 2.100
## Median :0.5380 Median :6.208 Median : 77.50 Median : 3.207
## Mean :0.5547 Mean :6.285 Mean : 68.57 Mean : 3.795
## 3rd Qu.:0.6240 3rd Qu.:6.623 3rd Qu.: 94.08 3rd Qu.: 5.188
## Max. :0.8710 Max. :8.780 Max. :100.00 Max. :12.127
## rad tax ptratio black
## Min. : 1.000 Min. :187.0 Min. :12.60 Min. : 0.32
## 1st Qu.: 4.000 1st Qu.:279.0 1st Qu.:17.40 1st Qu.:375.38
## Median : 5.000 Median :330.0 Median :19.05 Median :391.44
## Mean : 9.549 Mean :408.2 Mean :18.46 Mean :356.67
## 3rd Qu.:24.000 3rd Qu.:666.0 3rd Qu.:20.20 3rd Qu.:396.23
## Max. :24.000 Max. :711.0 Max. :22.00 Max. :396.90
## lstat medv
## Min. : 1.73 Min. : 5.00
## 1st Qu.: 6.95 1st Qu.:17.02
## Median :11.36 Median :21.20
## Mean :12.65 Mean :22.53
## 3rd Qu.:16.95 3rd Qu.:25.00
## Max. :37.97 Max. :50.00
# calculate the correlation matrix and round it
cor_matrix <- cor(Boston) %>% round(digits = 2)
# print the correlation matrix
cor_matrix
## crim zn indus chas nox rm age dis rad tax ptratio
## crim 1.00 -0.20 0.41 -0.06 0.42 -0.22 0.35 -0.38 0.63 0.58 0.29
## zn -0.20 1.00 -0.53 -0.04 -0.52 0.31 -0.57 0.66 -0.31 -0.31 -0.39
## indus 0.41 -0.53 1.00 0.06 0.76 -0.39 0.64 -0.71 0.60 0.72 0.38
## chas -0.06 -0.04 0.06 1.00 0.09 0.09 0.09 -0.10 -0.01 -0.04 -0.12
## nox 0.42 -0.52 0.76 0.09 1.00 -0.30 0.73 -0.77 0.61 0.67 0.19
## rm -0.22 0.31 -0.39 0.09 -0.30 1.00 -0.24 0.21 -0.21 -0.29 -0.36
## age 0.35 -0.57 0.64 0.09 0.73 -0.24 1.00 -0.75 0.46 0.51 0.26
## dis -0.38 0.66 -0.71 -0.10 -0.77 0.21 -0.75 1.00 -0.49 -0.53 -0.23
## rad 0.63 -0.31 0.60 -0.01 0.61 -0.21 0.46 -0.49 1.00 0.91 0.46
## tax 0.58 -0.31 0.72 -0.04 0.67 -0.29 0.51 -0.53 0.91 1.00 0.46
## ptratio 0.29 -0.39 0.38 -0.12 0.19 -0.36 0.26 -0.23 0.46 0.46 1.00
## black -0.39 0.18 -0.36 0.05 -0.38 0.13 -0.27 0.29 -0.44 -0.44 -0.18
## lstat 0.46 -0.41 0.60 -0.05 0.59 -0.61 0.60 -0.50 0.49 0.54 0.37
## medv -0.39 0.36 -0.48 0.18 -0.43 0.70 -0.38 0.25 -0.38 -0.47 -0.51
## black lstat medv
## crim -0.39 0.46 -0.39
## zn 0.18 -0.41 0.36
## indus -0.36 0.60 -0.48
## chas 0.05 -0.05 0.18
## nox -0.38 0.59 -0.43
## rm 0.13 -0.61 0.70
## age -0.27 0.60 -0.38
## dis 0.29 -0.50 0.25
## rad -0.44 0.49 -0.38
## tax -0.44 0.54 -0.47
## ptratio -0.18 0.37 -0.51
## black 1.00 -0.37 0.33
## lstat -0.37 1.00 -0.74
## medv 0.33 -0.74 1.00
# visualize the correlation matrix
corrplot(cor_matrix, method="circle", type = "upper", cl.pos = "b", tl.pos = "d", tl.cex = 0.6)
Based on the plot matrix it seems that all variables have statistically significant correlations with each other except chas (Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)) which doesn’t correlate at all or has a weaker correlation than that of other variables. Chas has values only of 0 and 1. From the density plots we can see for example that crim, zn, chas, dis, rad, tax and lstat are mostly very low (although some have a second small peak), only rm looks to be normally distributed, indus has two even peaks, and age, ptratio and black have mostly higher values.
Same trends are showing in the corrplot. However, maybe the strength (intensity of color) and sign of the correlation (blue for pos and red for neg) is more easily visible. Strongest negative correlations (meaning when one gets higher the other gets lower) seem to be between indus and dis, nox and dis, age and dis, and lstat and medv. Strongest positive correlation (one gets higher the other gets higher too) seem to be between rad and tax.
# Standardize the dataset
boston_scaled <- scale(Boston)
# Print out summaries of the scaled data
summary(boston_scaled)
## crim zn indus chas
## Min. :-0.419367 Min. :-0.48724 Min. :-1.5563 Min. :-0.2723
## 1st Qu.:-0.410563 1st Qu.:-0.48724 1st Qu.:-0.8668 1st Qu.:-0.2723
## Median :-0.390280 Median :-0.48724 Median :-0.2109 Median :-0.2723
## Mean : 0.000000 Mean : 0.00000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.007389 3rd Qu.: 0.04872 3rd Qu.: 1.0150 3rd Qu.:-0.2723
## Max. : 9.924110 Max. : 3.80047 Max. : 2.4202 Max. : 3.6648
## nox rm age dis
## Min. :-1.4644 Min. :-3.8764 Min. :-2.3331 Min. :-1.2658
## 1st Qu.:-0.9121 1st Qu.:-0.5681 1st Qu.:-0.8366 1st Qu.:-0.8049
## Median :-0.1441 Median :-0.1084 Median : 0.3171 Median :-0.2790
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.5981 3rd Qu.: 0.4823 3rd Qu.: 0.9059 3rd Qu.: 0.6617
## Max. : 2.7296 Max. : 3.5515 Max. : 1.1164 Max. : 3.9566
## rad tax ptratio black
## Min. :-0.9819 Min. :-1.3127 Min. :-2.7047 Min. :-3.9033
## 1st Qu.:-0.6373 1st Qu.:-0.7668 1st Qu.:-0.4876 1st Qu.: 0.2049
## Median :-0.5225 Median :-0.4642 Median : 0.2746 Median : 0.3808
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 1.6596 3rd Qu.: 1.5294 3rd Qu.: 0.8058 3rd Qu.: 0.4332
## Max. : 1.6596 Max. : 1.7964 Max. : 1.6372 Max. : 0.4406
## lstat medv
## Min. :-1.5296 Min. :-1.9063
## 1st Qu.:-0.7986 1st Qu.:-0.5989
## Median :-0.1811 Median :-0.1449
## Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.6024 3rd Qu.: 0.2683
## Max. : 3.5453 Max. : 2.9865
# change the object to data frame
boston_scaled <- as.data.frame(boston_scaled)
# create a quantile vector of crim and print it
bins <- quantile(boston_scaled$crim)
bins
## 0% 25% 50% 75% 100%
## -0.419366929 -0.410563278 -0.390280295 0.007389247 9.924109610
# create a categorical variable 'crime', using the quantiles as the break points in the categorical variable
crime <- cut(boston_scaled$crim, breaks = bins, include.lowest = TRUE, labels = c("low", "med_low", "med_high", "high"))
# Drop the old crime rate variable from the dataset
boston_scaled <- dplyr::select(boston_scaled, -crim)
# add the new categorical value to scaled data
boston_scaled <- data.frame(boston_scaled, crime)
# Divide the dataset to train and test sets, so that 80% of the data belongs to the train set
# number of rows in the Boston dataset
n <- nrow(boston_scaled)
# choose randomly 80% of the rows
ind <- sample(n, size = n * 0.8)
# create train set
train <- boston_scaled[ind,]
# create test set
test <- boston_scaled[-ind,]
As a result of scaling the variable values got smaller and the mean of each variable is zero. This is because in the scaling the column means were subtracted from the corresponding columns and the difference was divided with standard deviation.
# Fit the linear discriminant analysis on the train set. Use the categorical crime rate as the target variable and all the other variables in the dataset as predictor variables
# linear discriminant analysis
lda.fit <- lda(crime ~ ., data = train)
# print the lda.fit object
lda.fit
## Call:
## lda(crime ~ ., data = train)
##
## Prior probabilities of groups:
## low med_low med_high high
## 0.2673267 0.2425743 0.2475248 0.2425743
##
## Group means:
## zn indus chas nox rm age
## low 0.9691929 -0.9089157 -0.16296517 -0.8773034 0.43310935 -0.8994803
## med_low -0.1265035 -0.2695380 -0.11163110 -0.5611315 -0.16607253 -0.3191996
## med_high -0.3929105 0.1623854 0.20012296 0.4317047 0.02060242 0.4137680
## high -0.4872402 1.0171960 -0.07145661 1.1181640 -0.46565218 0.8504928
## dis rad tax ptratio black lstat
## low 0.8441420 -0.6979446 -0.7097905 -0.4452153 0.37788976 -0.77070038
## med_low 0.3209694 -0.5459225 -0.4470790 -0.1015357 0.31272351 -0.10549823
## med_high -0.4058625 -0.4156770 -0.3140979 -0.1873182 0.08204918 0.07367767
## high -0.8724425 1.6373367 1.5134896 0.7798552 -0.71097670 0.88792293
## medv
## low 0.50655731
## med_low -0.01444002
## med_high 0.06797723
## high -0.71375065
##
## Coefficients of linear discriminants:
## LD1 LD2 LD3
## zn 0.13793450 0.7107207358 -1.00129849
## indus -0.01680252 -0.1570751110 0.39172685
## chas -0.08386464 -0.1174589551 0.02734767
## nox 0.42230698 -0.6782361108 -1.53877206
## rm -0.09547367 -0.0708649040 -0.19359725
## age 0.30018977 -0.2628696590 -0.08087518
## dis -0.08781986 -0.1303867265 0.17827567
## rad 3.17275936 0.9793809225 0.01328385
## tax -0.02452146 0.0008232436 0.66291808
## ptratio 0.13012651 -0.0202560357 -0.49656307
## black -0.12449650 0.0282325115 0.14195264
## lstat 0.18079071 -0.2599928093 0.35072171
## medv 0.13895975 -0.3429475294 -0.14905851
##
## Proportion of trace:
## LD1 LD2 LD3
## 0.9444 0.0398 0.0159
#Draw the LDA (bi)plot.
# the function for lda biplot arrows
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "red", tex = 0.75, choices = c(1,2)){
heads <- coef(x)
graphics::arrows(x0 = 0, y0 = 0,
x1 = myscale * heads[,choices[1]],
y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
text(myscale * heads[,choices], labels = row.names(heads),
cex = tex, col=color, pos=3)
}
# target classes as numeric
classes <- as.numeric(train$crime)
# plot the lda results (select both lines and execute them at the same time!)
plot(lda.fit, dimen = 2, col = classes, pch = classes)
lda.arrows(lda.fit, myscale = 2)
# Save the crime categories from the test set
correct_classes <- test$crime
# remove the crime variable from test data
test <- dplyr::select(test, -crime)
# Then predict the classes with the LDA model on the test data
lda.pred <- predict(lda.fit, newdata = test)
# Cross tabulate the results with the crime categories from the test set.
table(correct = correct_classes, predicted = lda.pred$class)
## predicted
## correct low med_low med_high high
## low 11 7 1 0
## med_low 9 14 5 0
## med_high 1 8 16 1
## high 0 0 0 29
#Comment on the results.
Based on the cross tabulations the LDA model worked best on predicting the high crime rates as it predicted all of them right. Of the low crime rates the model predicted 15 out of 25 right, of the med_low 17 out of 28 and of the med_high rates 10 out of 23 were predicted right by the LDA model.
# Reload the Boston dataset
data("Boston")
# Standardize the dataset to get comparable distances
boston_scaled <- scale(Boston)
boston_scaled <- as.data.frame(boston_scaled)
# Calculate the distances between the observations
# euclidean distance matrix
dist_eu <- dist(boston_scaled)
# look at the summary of the distances
summary(dist_eu)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1343 3.4625 4.8241 4.9111 6.1863 14.3970
#Run k-means algorithm on the dataset.
# k-means clustering
km <- kmeans(boston_scaled, centers = 3)
# plot the Boston dataset with clusters
pairs(boston_scaled, col = km$cluster)
#Investigate what is the optimal number of clusters and run the algorithm again. Visualize the clusters (for example with the pairs() or ggpairs() functions, where the clusters are separated with colors) and interpret the results.
set.seed(123)
# determine the number of clusters
k_max <- 10
# calculate the total within sum of squares
twcss <- sapply(1:k_max, function(k){kmeans(boston_scaled, k)$tot.withinss})
# visualize the results
qplot(x = 1:k_max, y = twcss, geom = 'line') # here it seems 2 is an optimal number of clusters
## Warning: `qplot()` was deprecated in ggplot2 3.4.0.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
# k-means clustering
km <- kmeans(boston_scaled, centers = 2)
# plot the Boston dataset with clusters
pairs(boston_scaled, col = km$cluster)
#p <- ggpairs(boston_scaled, col = km$cluster)
#p
The optimal number of clusters is visible in the qplot where the line drops suddenly. In this case it is around 2. Therefore 2 is the optimal number of clusters.
# Standardize the dataset
boston_scaled <- scale(Boston)
boston_scaled <- as.data.frame(boston_scaled)
# Perform k-means on the original Boston data with some reasonable number of clusters (> 2)
# k-means clustering
km <- kmeans(boston_scaled, centers = 3)
# plot the Boston dataset with clusters
pairs(boston_scaled, col = km$cluster)
# add the cluster value to scaled data
boston_scaled <- data.frame(boston_scaled, km$cluster)
# Perform LDA using the clusters as target classes. Include all the variables in the Boston data in the LDA model.
# linear discriminant analysis
lda.fit <- lda(km.cluster ~ ., data = boston_scaled[,-4]) #here I got an error of variable 4 (aka chas) being constant within groups and that is why I removed it. I am not sure if this is right way to do it but now it doesn't give me an error
# print the lda.fit object
lda.fit
## Call:
## lda(km.cluster ~ ., data = boston_scaled[, -4])
##
## Prior probabilities of groups:
## 1 2 3
## 0.06916996 0.61067194 0.32015810
##
## Group means:
## crim zn indus nox rm age dis
## 1 -0.2048299 -0.1564737 0.2306535 0.3342374 0.3344149 0.3170678 -0.3634565
## 2 -0.3882449 0.2731699 -0.6264383 -0.5823006 0.2188304 -0.4585819 0.4807157
## 3 0.7847946 -0.4872402 1.1450405 1.0384727 -0.4896488 0.8062002 -0.8383961
## rad tax ptratio black lstat medv
## 1 -0.02700292 -0.1304164 -0.4453253 0.1787986 -0.1976385 0.6422884
## 2 -0.58641200 -0.6161585 -0.2814183 0.3151747 -0.4640135 0.3182241
## 3 1.12436056 1.2034416 0.6329916 -0.6397959 0.9277624 -0.7457491
##
## Coefficients of linear discriminants:
## LD1 LD2
## crim 0.02479166 -0.13204141
## zn 0.42787622 -0.04638198
## indus 1.15646011 0.64348753
## nox 0.47943272 0.42987408
## rm 0.13637610 -0.12823804
## age -0.06654278 0.30029385
## dis -0.01915297 0.20848367
## rad 0.74637979 0.69845574
## tax 0.27967651 -1.02040695
## ptratio 0.19355485 -0.21964359
## black -0.04753224 0.11581547
## lstat 0.47213016 0.02172623
## medv 0.06797263 0.92531426
##
## Proportion of trace:
## LD1 LD2
## 0.9822 0.0178
# Visualize the results with a biplot (include arrows representing the relationships of the original variables to the LDA solution).
# the function for lda biplot arrows
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "red", tex = 0.75, choices = c(1,2)){
heads <- coef(x)
graphics::arrows(x0 = 0, y0 = 0,
x1 = myscale * heads[,choices[1]],
y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
text(myscale * heads[,choices], labels = row.names(heads),
cex = tex, col=color, pos=3)
}
# plot the lda results
plot(lda.fit, dimen = 2, col = boston_scaled$km.cluster, pch = km$cluster)
lda.arrows(lda.fit, myscale = 2)
# Interpret the results. Which variables are the most influential linear separators for the clusters? (0-2 points to compensate any loss of points from the above exercises)
Based on the biplot it looks like the variables medv, rad, indus and tax are the most influential variables which shows inthe plot as longer arrows.
#Run the code below for the (scaled) train data that you used to fit the LDA. The code creates a matrix product, which is a projection of the data points.
model_predictors <- dplyr::select(train, -crime)
# check the dimensions
dim(model_predictors)
## [1] 404 13
dim(lda.fit$scaling)
## [1] 13 2
# matrix multiplication
matrix_product <- as.matrix(model_predictors) %*% lda.fit$scaling
matrix_product <- as.data.frame(matrix_product)
#Next, install and access the plotly package. Create a 3D plot (cool!) of the columns of the matrix product using the code below.
library(plotly)
##
## Attaching package: 'plotly'
## The following object is masked from 'package:MASS':
##
## select
## The following object is masked from 'package:ggplot2':
##
## last_plot
## The following object is masked from 'package:stats':
##
## filter
## The following object is masked from 'package:graphics':
##
## layout
plot_ly(x = matrix_product$LD1, y = matrix_product$LD2, z = matrix_product$LD3, type= 'scatter3d', mode='markers')
#Adjust the code: add argument color as a argument in the plot_ly() function. Set the color to be the crime classes of the train set.
plot_ly(x = matrix_product$LD1, y = matrix_product$LD2, z = matrix_product$LD3, type= 'scatter3d', mode='markers', color = train$crime)
#Draw another 3D plot where the color is defined by the clusters of the k-means. How do the plots differ? Are there any similarities?
#kmeans for train set
# Perform k-means on the original Boston data with some reasonable number of clusters (> 2)
# k-means clustering
set.seed(123)
km_train <- kmeans(train[,-14], centers = 3)
plot_ly(x = matrix_product$LD1, y = matrix_product$LD2, z = matrix_product$LD3, type= 'scatter3d', mode='markers', color = km_train$cluster)
## Warning in min(x, na.rm = na.rm): no non-missing arguments to min; returning
## Inf
## Warning in max(x, na.rm = na.rm): no non-missing arguments to max; returning
## -Inf
Yes, the plots differ in that the dots color in different way in the plots but the spatial location of the dots seem to stay the same in both the plots.
# Load libraries:
library(tibble)
# read in data:
human <- read_csv("./data/human.csv")
## Rows: 155 Columns: 9
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): country
## dbl (8): Edu2.FM, Labo.FM, Life.Exp, Edu.Exp, GNI, Mat.Mor, Ado.Birth, Parli
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# country names to rownames
human <- column_to_rownames(human, "country")
#summaries of variables in data
summary(human)
## Edu2.FM Labo.FM Life.Exp Edu.Exp
## Min. :0.1717 Min. :0.1857 Min. :49.00 Min. : 5.40
## 1st Qu.:0.7264 1st Qu.:0.5984 1st Qu.:66.30 1st Qu.:11.25
## Median :0.9375 Median :0.7535 Median :74.20 Median :13.50
## Mean :0.8529 Mean :0.7074 Mean :71.65 Mean :13.18
## 3rd Qu.:0.9968 3rd Qu.:0.8535 3rd Qu.:77.25 3rd Qu.:15.20
## Max. :1.4967 Max. :1.0380 Max. :83.50 Max. :20.20
## GNI Mat.Mor Ado.Birth Parli
## Min. : 581 Min. : 1.0 Min. : 0.60 Min. : 0.00
## 1st Qu.: 4198 1st Qu.: 11.5 1st Qu.: 12.65 1st Qu.:12.40
## Median : 12040 Median : 49.0 Median : 33.60 Median :19.30
## Mean : 17628 Mean : 149.1 Mean : 47.16 Mean :20.91
## 3rd Qu.: 24512 3rd Qu.: 190.0 3rd Qu.: 71.95 3rd Qu.:27.95
## Max. :123124 Max. :1100.0 Max. :204.80 Max. :57.50
# graphical summary
library("GGally")
p <- ggpairs(human)
# calculate the correlation matrix and round it
cor_matrix <- cor(human) %>% round(digits = 2)
# print the correlation matrix
cor_matrix
## Edu2.FM Labo.FM Life.Exp Edu.Exp GNI Mat.Mor Ado.Birth Parli
## Edu2.FM 1.00 0.01 0.58 0.59 0.43 -0.66 -0.53 0.08
## Labo.FM 0.01 1.00 -0.14 0.05 -0.02 0.24 0.12 0.25
## Life.Exp 0.58 -0.14 1.00 0.79 0.63 -0.86 -0.73 0.17
## Edu.Exp 0.59 0.05 0.79 1.00 0.62 -0.74 -0.70 0.21
## GNI 0.43 -0.02 0.63 0.62 1.00 -0.50 -0.56 0.09
## Mat.Mor -0.66 0.24 -0.86 -0.74 -0.50 1.00 0.76 -0.09
## Ado.Birth -0.53 0.12 -0.73 -0.70 -0.56 0.76 1.00 -0.07
## Parli 0.08 0.25 0.17 0.21 0.09 -0.09 -0.07 1.00
# visualize the correlation matrix
corrplot(cor_matrix, method="circle", type = "upper", cl.pos = "b", tl.pos = "d", tl.cex = 0.6)
In the plot matrix created with ggpairs() the variables are on the top and left side of the plot. The top right half of the matrix shows the correlation between the variables, the lower left half shows the scatter plots of the continuous variables, the diagonal the density plots of the continuous variables.
From the matrix we can see that many variables correlate statistically significantly with each other. There is both negative and positive correlations. From the distribution plots in the diagonal as well as summaries of the variables we can see that none of the variables are normally distributed.
The correlation plot visualises the correlations between variable better and we can see for example that life expectancy and expected years of schooling is strongly negatively correlated with maternal mortality ratio and adolescent birth rate.
# perform principal component analysis (with the SVD method)
pca_human <- prcomp(human)
# draw a biplot of the principal component representation and the original variables, with PC1 and PC2
#commented out but graph below as an image
#p <- biplot(pca_human, choices = 1:2, cex = c(0.8, 1), col = c("grey40", "deeppink2"))
# standardise the variables:
human_std <- scale(human)
summary(human_std)
## Edu2.FM Labo.FM Life.Exp Edu.Exp
## Min. :-2.8189 Min. :-2.6247 Min. :-2.7188 Min. :-2.7378
## 1st Qu.:-0.5233 1st Qu.:-0.5484 1st Qu.:-0.6425 1st Qu.:-0.6782
## Median : 0.3503 Median : 0.2316 Median : 0.3056 Median : 0.1140
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.5958 3rd Qu.: 0.7350 3rd Qu.: 0.6717 3rd Qu.: 0.7126
## Max. : 2.6646 Max. : 1.6632 Max. : 1.4218 Max. : 2.4730
## GNI Mat.Mor Ado.Birth Parli
## Min. :-0.9193 Min. :-0.6992 Min. :-1.1325 Min. :-1.8203
## 1st Qu.:-0.7243 1st Qu.:-0.6496 1st Qu.:-0.8394 1st Qu.:-0.7409
## Median :-0.3013 Median :-0.4726 Median :-0.3298 Median :-0.1403
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.3712 3rd Qu.: 0.1932 3rd Qu.: 0.6030 3rd Qu.: 0.6127
## Max. : 5.6890 Max. : 4.4899 Max. : 3.8344 Max. : 3.1850
human_std <- as.data.frame(human_std)
# repeat PCA:
# perform principal component analysis (with the SVD method)
pca_human <- prcomp(human_std)
# draw a biplot of the principal component representation and the original variables, with PC1 and PC2
#commented out but graph below as an image
#p <- biplot(pca_human, choices = 1:2, cex = c(0.8, 1), col = c("grey40", "deeppink2"))
The PCA biplots created with standardised and non-standardised data do differ. This is likely due to that standardization makes the values more comparable and therefore the PCA biplots look different as well. In the non standardised biplot only the gross national income per capita seems to have influence on the first principal component and no influence on the second principal component. Looks like the other variables have no influence on the principal components. There is no clear separate clusters but two groups could be seen, one forming a vertical group and one forming a diagonal group.
In the PCA biplot with standardised data it is visible that more variables have influence on the principal components (PC). Most of the variables seem to influence the first PC, only Percent Representation in Parliament (Parli) and ratio of labor force participation of females and males(Labo.FM) don’t seem to influence PC1 much. Parli and Labo.FM seem to influence the second PC. Maternal Mortality Ratio (Mat.Mor) and Adolescent Birth Rate seem to positively correlate as well as Life expectancy at birth (Life.Exp), ratio of female and male populations with secondary education, Gross National Income per capita and Expected years of schooling. Variables in these two groups also negatively correlate with the variables from the other group (eg. Mat.Mor negatively correlates with Life.Exp). Parli and Labo.FM are positively correlated but do not correlate with other variables.
About the data: “The tea data comes from the FactoMineR package and it is measured with a questionnaire on tea: 300 individuals were asked how they drink tea (18 questions) and what are their product’s perception (12 questions). In addition, some personal details were asked (4 questions).”
# Load libraries:
library(dplyr)
library(tidyr)
library(ggplot2)
library(FactoMineR)
#library(factoextra)
# Load tea data
tea <- read.csv("https://raw.githubusercontent.com/KimmoVehkalahti/Helsinki-Open-Data-Science/master/datasets/tea.csv", stringsAsFactors = TRUE)
# Look at structure and dimensions of the data
str(tea)
## 'data.frame': 300 obs. of 36 variables:
## $ breakfast : Factor w/ 2 levels "breakfast","Not.breakfast": 1 1 2 2 1 2 1 2 1 1 ...
## $ tea.time : Factor w/ 2 levels "Not.tea time",..: 1 1 2 1 1 1 2 2 2 1 ...
## $ evening : Factor w/ 2 levels "evening","Not.evening": 2 2 1 2 1 2 2 1 2 1 ...
## $ lunch : Factor w/ 2 levels "lunch","Not.lunch": 2 2 2 2 2 2 2 2 2 2 ...
## $ dinner : Factor w/ 2 levels "dinner","Not.dinner": 2 2 1 1 2 1 2 2 2 2 ...
## $ always : Factor w/ 2 levels "always","Not.always": 2 2 2 2 1 2 2 2 2 2 ...
## $ home : Factor w/ 2 levels "home","Not.home": 1 1 1 1 1 1 1 1 1 1 ...
## $ work : Factor w/ 2 levels "Not.work","work": 1 1 2 1 1 1 1 1 1 1 ...
## $ tearoom : Factor w/ 2 levels "Not.tearoom",..: 1 1 1 1 1 1 1 1 1 2 ...
## $ friends : Factor w/ 2 levels "friends","Not.friends": 2 2 1 2 2 2 1 2 2 2 ...
## $ resto : Factor w/ 2 levels "Not.resto","resto": 1 1 2 1 1 1 1 1 1 1 ...
## $ pub : Factor w/ 2 levels "Not.pub","pub": 1 1 1 1 1 1 1 1 1 1 ...
## $ Tea : Factor w/ 3 levels "black","Earl Grey",..: 1 1 2 2 2 2 2 1 2 1 ...
## $ How : Factor w/ 4 levels "alone","lemon",..: 1 3 1 1 1 1 1 3 3 1 ...
## $ sugar : Factor w/ 2 levels "No.sugar","sugar": 2 1 1 2 1 1 1 1 1 1 ...
## $ how : Factor w/ 3 levels "tea bag","tea bag+unpackaged",..: 1 1 1 1 1 1 1 1 2 2 ...
## $ where : Factor w/ 3 levels "chain store",..: 1 1 1 1 1 1 1 1 2 2 ...
## $ price : Factor w/ 6 levels "p_branded","p_cheap",..: 4 6 6 6 6 3 6 6 5 5 ...
## $ age : int 39 45 47 23 48 21 37 36 40 37 ...
## $ sex : Factor w/ 2 levels "F","M": 2 1 1 2 2 2 2 1 2 2 ...
## $ SPC : Factor w/ 7 levels "employee","middle",..: 2 2 4 6 1 6 5 2 5 5 ...
## $ Sport : Factor w/ 2 levels "Not.sportsman",..: 2 2 2 1 2 2 2 2 2 1 ...
## $ age_Q : Factor w/ 5 levels "+60","15-24",..: 4 5 5 2 5 2 4 4 4 4 ...
## $ frequency : Factor w/ 4 levels "+2/day","1 to 2/week",..: 3 3 1 3 1 3 4 2 1 1 ...
## $ escape.exoticism: Factor w/ 2 levels "escape-exoticism",..: 2 1 2 1 1 2 2 2 2 2 ...
## $ spirituality : Factor w/ 2 levels "Not.spirituality",..: 1 1 1 2 2 1 1 1 1 1 ...
## $ healthy : Factor w/ 2 levels "healthy","Not.healthy": 1 1 1 1 2 1 1 1 2 1 ...
## $ diuretic : Factor w/ 2 levels "diuretic","Not.diuretic": 2 1 1 2 1 2 2 2 2 1 ...
## $ friendliness : Factor w/ 2 levels "friendliness",..: 2 2 1 2 1 2 2 1 2 1 ...
## $ iron.absorption : Factor w/ 2 levels "iron absorption",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ feminine : Factor w/ 2 levels "feminine","Not.feminine": 2 2 2 2 2 2 2 1 2 2 ...
## $ sophisticated : Factor w/ 2 levels "Not.sophisticated",..: 1 1 1 2 1 1 1 2 2 1 ...
## $ slimming : Factor w/ 2 levels "No.slimming",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ exciting : Factor w/ 2 levels "exciting","No.exciting": 2 1 2 2 2 2 2 2 2 2 ...
## $ relaxing : Factor w/ 2 levels "No.relaxing",..: 1 1 2 2 2 2 2 2 2 2 ...
## $ effect.on.health: Factor w/ 2 levels "effect on health",..: 2 2 2 2 2 2 2 2 2 2 ...
dim(tea) # 300 obs. 36 var.
## [1] 300 36
View(tea)
# From now on I will focus on the "Tea", "How", "how", "sugar", "where" and "lunch" columns:
# column names to keep in the dataset
keep_columns <- c("Tea", "How", "how", "sugar", "where", "lunch")
# select the 'keep_columns' to create a new dataset
tea_time <- select(tea, keep_columns)
## Warning: Using an external vector in selections was deprecated in tidyselect 1.1.0.
## ℹ Please use `all_of()` or `any_of()` instead.
## # Was:
## data %>% select(keep_columns)
##
## # Now:
## data %>% select(all_of(keep_columns))
##
## See <https://tidyselect.r-lib.org/reference/faq-external-vector.html>.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
# look at the summaries and structure of the data
summary(tea_time)
## Tea How how sugar
## black : 74 alone:195 tea bag :170 No.sugar:155
## Earl Grey:193 lemon: 33 tea bag+unpackaged: 94 sugar :145
## green : 33 milk : 63 unpackaged : 36
## other: 9
## where lunch
## chain store :192 lunch : 44
## chain store+tea shop: 78 Not.lunch:256
## tea shop : 30
##
str(tea_time)
## 'data.frame': 300 obs. of 6 variables:
## $ Tea : Factor w/ 3 levels "black","Earl Grey",..: 1 1 2 2 2 2 2 1 2 1 ...
## $ How : Factor w/ 4 levels "alone","lemon",..: 1 3 1 1 1 1 1 3 3 1 ...
## $ how : Factor w/ 3 levels "tea bag","tea bag+unpackaged",..: 1 1 1 1 1 1 1 1 2 2 ...
## $ sugar: Factor w/ 2 levels "No.sugar","sugar": 2 1 1 2 1 1 1 1 1 1 ...
## $ where: Factor w/ 3 levels "chain store",..: 1 1 1 1 1 1 1 1 2 2 ...
## $ lunch: Factor w/ 2 levels "lunch","Not.lunch": 2 2 2 2 2 2 2 2 2 2 ...
# visualize the dataset
pivot_longer(tea_time, cols = everything()) %>%
ggplot(aes(value)) + facet_wrap("name", scales = "free") + geom_bar() + theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 8))
# multiple correspondence analysis
mca <- MCA(tea_time, graph = FALSE)
# summary of the model
summary(mca)
##
## Call:
## MCA(X = tea_time, graph = FALSE)
##
##
## Eigenvalues
## Dim.1 Dim.2 Dim.3 Dim.4 Dim.5 Dim.6 Dim.7
## Variance 0.279 0.261 0.219 0.189 0.177 0.156 0.144
## % of var. 15.238 14.232 11.964 10.333 9.667 8.519 7.841
## Cumulative % of var. 15.238 29.471 41.435 51.768 61.434 69.953 77.794
## Dim.8 Dim.9 Dim.10 Dim.11
## Variance 0.141 0.117 0.087 0.062
## % of var. 7.705 6.392 4.724 3.385
## Cumulative % of var. 85.500 91.891 96.615 100.000
##
## Individuals (the 10 first)
## Dim.1 ctr cos2 Dim.2 ctr cos2 Dim.3
## 1 | -0.298 0.106 0.086 | -0.328 0.137 0.105 | -0.327
## 2 | -0.237 0.067 0.036 | -0.136 0.024 0.012 | -0.695
## 3 | -0.369 0.162 0.231 | -0.300 0.115 0.153 | -0.202
## 4 | -0.530 0.335 0.460 | -0.318 0.129 0.166 | 0.211
## 5 | -0.369 0.162 0.231 | -0.300 0.115 0.153 | -0.202
## 6 | -0.369 0.162 0.231 | -0.300 0.115 0.153 | -0.202
## 7 | -0.369 0.162 0.231 | -0.300 0.115 0.153 | -0.202
## 8 | -0.237 0.067 0.036 | -0.136 0.024 0.012 | -0.695
## 9 | 0.143 0.024 0.012 | 0.871 0.969 0.435 | -0.067
## 10 | 0.476 0.271 0.140 | 0.687 0.604 0.291 | -0.650
## ctr cos2
## 1 0.163 0.104 |
## 2 0.735 0.314 |
## 3 0.062 0.069 |
## 4 0.068 0.073 |
## 5 0.062 0.069 |
## 6 0.062 0.069 |
## 7 0.062 0.069 |
## 8 0.735 0.314 |
## 9 0.007 0.003 |
## 10 0.643 0.261 |
##
## Categories (the 10 first)
## Dim.1 ctr cos2 v.test Dim.2 ctr cos2
## black | 0.473 3.288 0.073 4.677 | 0.094 0.139 0.003
## Earl Grey | -0.264 2.680 0.126 -6.137 | 0.123 0.626 0.027
## green | 0.486 1.547 0.029 2.952 | -0.933 6.111 0.107
## alone | -0.018 0.012 0.001 -0.418 | -0.262 2.841 0.127
## lemon | 0.669 2.938 0.055 4.068 | 0.531 1.979 0.035
## milk | -0.337 1.420 0.030 -3.002 | 0.272 0.990 0.020
## other | 0.288 0.148 0.003 0.876 | 1.820 6.347 0.102
## tea bag | -0.608 12.499 0.483 -12.023 | -0.351 4.459 0.161
## tea bag+unpackaged | 0.350 2.289 0.056 4.088 | 1.024 20.968 0.478
## unpackaged | 1.958 27.432 0.523 12.499 | -1.015 7.898 0.141
## v.test Dim.3 ctr cos2 v.test
## black 0.929 | -1.081 21.888 0.382 -10.692 |
## Earl Grey 2.867 | 0.433 9.160 0.338 10.053 |
## green -5.669 | -0.108 0.098 0.001 -0.659 |
## alone -6.164 | -0.113 0.627 0.024 -2.655 |
## lemon 3.226 | 1.329 14.771 0.218 8.081 |
## milk 2.422 | 0.013 0.003 0.000 0.116 |
## other 5.534 | -2.524 14.526 0.197 -7.676 |
## tea bag -6.941 | -0.065 0.183 0.006 -1.287 |
## tea bag+unpackaged 11.956 | 0.019 0.009 0.000 0.226 |
## unpackaged -6.482 | 0.257 0.602 0.009 1.640 |
##
## Categorical variables (eta2)
## Dim.1 Dim.2 Dim.3
## Tea | 0.126 0.108 0.410 |
## How | 0.076 0.190 0.394 |
## how | 0.708 0.522 0.010 |
## sugar | 0.065 0.001 0.336 |
## where | 0.702 0.681 0.055 |
## lunch | 0.000 0.064 0.111 |
# visualize MCA
# these are commented out to avoid duplicate graphs as I attached them as images
#p1 <- plot(mca, invisible=c("ind"), graph.type = "classic", habillage = "quali")
#p2 <- plot(mca, graph.type = "classic", habillage = "quali")
In the MCA plot it is
visible that the varibales other, unpackaged and teashop are situated
close to each other this means that they are similar to each other in
relation to all variables. It could be that people that buy tea from tea
shop also by tea unpackaged. These variables are far from the origin and
they have effect on on both dimensions. The variables other,
chainstore+tea shop, and tea bag+unpackaged are also situated closely to
each other meaning they are similar to each other in realtion to other
variables and they influence the dimension 2. Green is situated
separately from all other variables meaning that it could be somehow
different from all other variables and it influences mostly the
dimension 1. The map with both variables and individuals shows how the
individuals are situated in the space in in relation to each other, the
variables, and dimensions.
# Read the RATS data
RATS <- read.table("https://raw.githubusercontent.com/KimmoVehkalahti/MABS/master/Examples/data/rats.txt", header = TRUE, sep = '\t')
# Look at the (column) names of BPRS
names(RATS)
## [1] "ID" "Group" "WD1" "WD8" "WD15" "WD22" "WD29" "WD36" "WD43"
## [10] "WD44" "WD50" "WD57" "WD64"
# Look at the structure of BPRS
str(RATS)
## 'data.frame': 16 obs. of 13 variables:
## $ ID : int 1 2 3 4 5 6 7 8 9 10 ...
## $ Group: int 1 1 1 1 1 1 1 1 2 2 ...
## $ WD1 : int 240 225 245 260 255 260 275 245 410 405 ...
## $ WD8 : int 250 230 250 255 260 265 275 255 415 420 ...
## $ WD15 : int 255 230 250 255 255 270 260 260 425 430 ...
## $ WD22 : int 260 232 255 265 270 275 270 268 428 440 ...
## $ WD29 : int 262 240 262 265 270 275 273 270 438 448 ...
## $ WD36 : int 258 240 265 268 273 277 274 265 443 460 ...
## $ WD43 : int 266 243 267 270 274 278 276 265 442 458 ...
## $ WD44 : int 266 244 267 272 273 278 271 267 446 464 ...
## $ WD50 : int 265 238 264 274 276 284 282 273 456 475 ...
## $ WD57 : int 272 247 268 273 278 279 281 274 468 484 ...
## $ WD64 : int 278 245 269 275 280 281 284 278 478 496 ...
# Print out summaries of the variables
summary(RATS)
## ID Group WD1 WD8 WD15
## Min. : 1.00 Min. :1.00 Min. :225.0 Min. :230.0 Min. :230.0
## 1st Qu.: 4.75 1st Qu.:1.00 1st Qu.:252.5 1st Qu.:255.0 1st Qu.:255.0
## Median : 8.50 Median :1.50 Median :340.0 Median :345.0 Median :347.5
## Mean : 8.50 Mean :1.75 Mean :365.9 Mean :369.1 Mean :372.5
## 3rd Qu.:12.25 3rd Qu.:2.25 3rd Qu.:480.0 3rd Qu.:476.2 3rd Qu.:486.2
## Max. :16.00 Max. :3.00 Max. :555.0 Max. :560.0 Max. :565.0
## WD22 WD29 WD36 WD43
## Min. :232.0 Min. :240.0 Min. :240.0 Min. :243.0
## 1st Qu.:267.2 1st Qu.:268.8 1st Qu.:267.2 1st Qu.:269.2
## Median :351.5 Median :356.5 Median :360.0 Median :360.0
## Mean :379.2 Mean :383.9 Mean :387.0 Mean :386.0
## 3rd Qu.:492.5 3rd Qu.:497.8 3rd Qu.:504.2 3rd Qu.:501.0
## Max. :580.0 Max. :590.0 Max. :597.0 Max. :595.0
## WD44 WD50 WD57 WD64
## Min. :244.0 Min. :238.0 Min. :247.0 Min. :245.0
## 1st Qu.:270.0 1st Qu.:273.8 1st Qu.:273.8 1st Qu.:278.0
## Median :362.0 Median :370.0 Median :373.5 Median :378.0
## Mean :388.3 Mean :394.6 Mean :398.6 Mean :404.1
## 3rd Qu.:510.5 3rd Qu.:516.0 3rd Qu.:524.5 3rd Qu.:530.8
## Max. :595.0 Max. :612.0 Max. :618.0 Max. :628.0
# Access the packages dplyr and tidyr
library(dplyr)
library(tidyr)
# Factor variables ID and Group
RATS$ID <- factor(RATS$ID)
RATS$Group <- factor(RATS$Group)
# Convert to long form
RATSL <- pivot_longer(RATS, cols = -c(ID, Group),
names_to = "WD",
values_to = "Weight") %>%
mutate(Time = as.integer(substr(WD,3,4))) %>%
arrange(Time)
# Take a glimpse at the RATSL data
glimpse(RATSL)
## Rows: 176
## Columns: 5
## $ ID <fct> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 1, 2, 3,…
## $ Group <fct> 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3, 1, 1, 1, 1, 1, …
## $ WD <chr> "WD1", "WD1", "WD1", "WD1", "WD1", "WD1", "WD1", "WD1", "WD1", …
## $ Weight <int> 240, 225, 245, 260, 255, 260, 275, 245, 410, 405, 445, 555, 470…
## $ Time <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 8, 8, 8, 8, 8, …
#Access the package ggplot2
library(ggplot2)
# Draw the plot
ggplot(RATSL, aes(x = Time, y = Weight, linetype = ID)) +
geom_line() +
scale_linetype_manual(values = rep(1:10, times=4)) +
facet_grid(. ~ Group, labeller = label_both) +
theme(legend.position = "none") #+
scale_y_continuous()
## <ScaleContinuousPosition>
## Range:
## Limits: 0 -- 1
From this plot we can see that group 2 and 3 differ clearly from the group one already with starting weights. Group 1 also has more individuals in it. In group 2 there is an individual that seems to be an outlier with much higher weight throughout the study.
library(dplyr)
library(tidyr)
# Standardise the variable RATSL
RATSL <- RATSL %>%
group_by(Time) %>%
mutate(stdrats = (Weight - mean(Weight))/sd(Weight)) %>%
ungroup()
# Glimpse the data
glimpse(RATSL)
## Rows: 176
## Columns: 6
## $ ID <fct> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 1, 2, 3…
## $ Group <fct> 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3, 1, 1, 1, 1, 1,…
## $ WD <chr> "WD1", "WD1", "WD1", "WD1", "WD1", "WD1", "WD1", "WD1", "WD1",…
## $ Weight <int> 240, 225, 245, 260, 255, 260, 275, 245, 410, 405, 445, 555, 47…
## $ Time <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 8, 8, 8, 8, 8,…
## $ stdrats <dbl> -1.0011429, -1.1203857, -0.9613953, -0.8421525, -0.8819001, -0…
# Plot again with the standardised bprs
library(ggplot2)
ggplot(RATSL, aes(x = Time, y = stdrats, linetype = ID)) +
geom_line() +
scale_linetype_manual(values = rep(1:10, times=4)) +
facet_grid(. ~ Group, labeller = label_both) +
scale_y_continuous(name = "standardized weight")
In the plot with standardized values the lines are more straight as the
weight have been standardised in relation to all other weights. Whereas
in the previous non standardised plot the weight had an increasing
trend, in this plot some of the standardised weights have even a
decreasing trend.
library(dplyr)
library(tidyr)
# Summary data with mean and standard error of bprs by treatment and week
RATSS <- RATSL %>%
group_by(Group, Time) %>%
summarise( mean = Weight, se = Weight ) %>%
ungroup()
## Warning: Returning more (or less) than 1 row per `summarise()` group was deprecated in
## dplyr 1.1.0.
## ℹ Please use `reframe()` instead.
## ℹ When switching from `summarise()` to `reframe()`, remember that `reframe()`
## always returns an ungrouped data frame and adjust accordingly.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## `summarise()` has grouped output by 'Group', 'Time'. You can override using the
## `.groups` argument.
# Glimpse the data
glimpse(RATSS)
## Rows: 176
## Columns: 4
## $ Group <fct> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ Time <int> 1, 1, 1, 1, 1, 1, 1, 1, 8, 8, 8, 8, 8, 8, 8, 8, 15, 15, 15, 15, …
## $ mean <int> 240, 225, 245, 260, 255, 260, 275, 245, 250, 230, 250, 255, 260,…
## $ se <int> 240, 225, 245, 260, 255, 260, 275, 245, 250, 230, 250, 255, 260,…
# Plot the mean profiles
library(ggplot2)
ggplot(RATSS, aes(x = Time, y = mean, linetype = Group, shape = Group)) +
geom_line() +
scale_linetype_manual(values = c(1,2,3)) +
geom_point(size=3) +
scale_shape_manual(values = c(1,2,3)) +
geom_errorbar(aes(ymin=mean-se, ymax=mean+se, linetype="1"), width=0.3) +
theme(legend.position = c(0.8,0.8)) +
scale_y_continuous(name = "mean(Weight) +/- se(Weight)")
In the mean profiles plot it is again visible that the group 1 is
different from the groups 2 and 3 with lower mean weights.
library(dplyr)
library(tidyr)
# Create a summary data by Group and ID with mean as the summary variable (ignoring baseline week 0)
RATSL8S <- RATSL %>%
#filter(week > 0) %>%
group_by(Group, ID) %>%
summarise( mean=mean(Weight) ) %>%
ungroup()
## `summarise()` has grouped output by 'Group'. You can override using the
## `.groups` argument.
# Glimpse the data
glimpse(RATSL8S)
## Rows: 16
## Columns: 3
## $ Group <fct> 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3
## $ ID <fct> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16
## $ mean <dbl> 261.0909, 237.6364, 260.1818, 266.5455, 269.4545, 274.7273, 274.…
# Draw a boxplot of the mean versus treatment
library(ggplot2)
ggplot(RATSL8S, aes(x = Group, y = mean)) +
geom_boxplot() +
stat_summary(fun = "mean", geom = "point", shape=23, size=4, fill = "white") +
scale_y_continuous(name = "mean(Weight), weeks 1-9")
# Create a new data by filtering the outlier and adjust the ggplot code the draw the plot again with the new data
RATSL8S1 <- filter(RATSL8S, mean <= 550)
# Glimpse the data
glimpse(RATSL8S1)
## Rows: 15
## Columns: 3
## $ Group <fct> 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 3, 3, 3, 3
## $ ID <fct> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 13, 14, 15, 16
## $ mean <dbl> 261.0909, 237.6364, 260.1818, 266.5455, 269.4545, 274.7273, 274.…
ggplot(RATSL8S1, aes(x = Group, y = mean)) +
geom_boxplot() +
stat_summary(fun = "mean", geom = "point", shape=23, size=4, fill = "white") +
scale_y_continuous(name = "mean(Weight), weeks 1-9")
In the first boxplot of the mean versus the group, the outlier in group
2 is visible. Looks like groups 1 and 3 also have outliers although less
so that group 2. After removing the outlier of group 2 the box
representing the deviation got much smaller.
library(dplyr)
library(tidyr)
# Add the baseline from the original data as a new variable to the summary data
RATSL8S2 <- RATSL8S %>%
mutate(baseline = RATS$WD1)
# Fit the linear model with the mean as the response
fit <- lm(mean ~ baseline + Group, data = RATSL8S2)
# Compute the analysis of variance table for the fitted model with anova()
anova(fit)
## Analysis of Variance Table
##
## Response: mean
## Df Sum Sq Mean Sq F value Pr(>F)
## baseline 1 252125 252125 2237.0655 5.217e-15 ***
## Group 2 726 363 3.2219 0.07586 .
## Residuals 12 1352 113
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Based on the anova results the baseline weights are strongly related to the mean weights in the following weeks.
BPRS <- read.table("https://raw.githubusercontent.com/KimmoVehkalahti/MABS/master/Examples/data/BPRS.txt", sep =" ", header = T)
# Look at the (column) names of BPRS
names(BPRS)
## [1] "treatment" "subject" "week0" "week1" "week2" "week3"
## [7] "week4" "week5" "week6" "week7" "week8"
# Look at the structure of BPRS
str(BPRS)
## 'data.frame': 40 obs. of 11 variables:
## $ treatment: int 1 1 1 1 1 1 1 1 1 1 ...
## $ subject : int 1 2 3 4 5 6 7 8 9 10 ...
## $ week0 : int 42 58 54 55 72 48 71 30 41 57 ...
## $ week1 : int 36 68 55 77 75 43 61 36 43 51 ...
## $ week2 : int 36 61 41 49 72 41 47 38 39 51 ...
## $ week3 : int 43 55 38 54 65 38 30 38 35 55 ...
## $ week4 : int 41 43 43 56 50 36 27 31 28 53 ...
## $ week5 : int 40 34 28 50 39 29 40 26 22 43 ...
## $ week6 : int 38 28 29 47 32 33 30 26 20 43 ...
## $ week7 : int 47 28 25 42 38 27 31 25 23 39 ...
## $ week8 : int 51 28 24 46 32 25 31 24 21 32 ...
# Print out summaries of the variables
summary(BPRS)
## treatment subject week0 week1 week2
## Min. :1.0 Min. : 1.00 Min. :24.00 Min. :23.00 Min. :26.0
## 1st Qu.:1.0 1st Qu.: 5.75 1st Qu.:38.00 1st Qu.:35.00 1st Qu.:32.0
## Median :1.5 Median :10.50 Median :46.00 Median :41.00 Median :38.0
## Mean :1.5 Mean :10.50 Mean :48.00 Mean :46.33 Mean :41.7
## 3rd Qu.:2.0 3rd Qu.:15.25 3rd Qu.:58.25 3rd Qu.:54.25 3rd Qu.:49.0
## Max. :2.0 Max. :20.00 Max. :78.00 Max. :95.00 Max. :75.0
## week3 week4 week5 week6 week7
## Min. :24.00 Min. :20.00 Min. :20.00 Min. :19.00 Min. :18.0
## 1st Qu.:29.75 1st Qu.:28.00 1st Qu.:26.00 1st Qu.:22.75 1st Qu.:23.0
## Median :36.50 Median :34.50 Median :30.50 Median :28.50 Median :30.0
## Mean :39.15 Mean :36.35 Mean :32.55 Mean :31.23 Mean :32.2
## 3rd Qu.:44.50 3rd Qu.:43.00 3rd Qu.:38.00 3rd Qu.:37.00 3rd Qu.:38.0
## Max. :76.00 Max. :66.00 Max. :64.00 Max. :64.00 Max. :62.0
## week8
## Min. :20.00
## 1st Qu.:22.75
## Median :28.00
## Mean :31.43
## 3rd Qu.:35.25
## Max. :75.00
library(dplyr)
library(tidyr)
# Factor treatment & subject
BPRS$treatment <- factor(BPRS$treatment)
BPRS$subject <- factor(BPRS$subject)
# Convert to long form
BPRSL <- pivot_longer(BPRS, cols = -c(treatment, subject),
names_to = "weeks", values_to = "bprs") %>%
arrange(weeks) #order by weeks variable
# Extract the week number
BPRSL <- BPRSL %>%
mutate(week = as.integer(substr(weeks,5,5)))
# Take a glimpse at the BPRSL data
glimpse(BPRSL)
## Rows: 360
## Columns: 5
## $ treatment <fct> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ subject <fct> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 1…
## $ weeks <chr> "week0", "week0", "week0", "week0", "week0", "week0", "week0…
## $ bprs <int> 42, 58, 54, 55, 72, 48, 71, 30, 41, 57, 30, 55, 36, 38, 66, …
## $ week <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
# Plot the data
library(ggplot2)
ggplot(BPRSL, aes(x = week, y = bprs, linetype = subject)) +
geom_line()
ggplot(BPRSL, aes(x = week, y = bprs, group = subject)) +
geom_line() +
scale_linetype_manual(values = rep(1:10, times=4)) +
scale_x_continuous(name = "Time (weeks)", breaks = seq(0, 60, 10)) +
scale_y_continuous(name = "bprs") +
theme(legend.position = "top")
ggplot(BPRSL, aes(x = week, y = bprs, linetype = subject)) +
geom_line() +
scale_linetype_manual(values = rep(1:10, times=4)) +
facet_grid(. ~ treatment, labeller = label_both) +
theme(legend.position = "none") +
scale_y_continuous(limits = c(min(BPRSL$bprs), max(BPRSL$bprs)))
# create a regression model RATS_reg
BPRS_reg <- lm(bprs ~ week + treatment, data = BPRSL)
# print out a summary of the model
summary(BPRS_reg)
##
## Call:
## lm(formula = bprs ~ week + treatment, data = BPRSL)
##
## Residuals:
## Min 1Q Median 3Q Max
## -22.454 -8.965 -3.196 7.002 50.244
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 46.4539 1.3670 33.982 <2e-16 ***
## week -2.2704 0.2524 -8.995 <2e-16 ***
## treatment2 0.5722 1.3034 0.439 0.661
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 12.37 on 357 degrees of freedom
## Multiple R-squared: 0.1851, Adjusted R-squared: 0.1806
## F-statistic: 40.55 on 2 and 357 DF, p-value: < 2.2e-16
Based on the coefficient estimates of the summary statistics of the fitted model, higher number of week is associated with a lower likelihood for bprs being high. This coefficient is statistivally significant with the p-value of <2e-16. Therefore from the summary of the model one could say that the later the evaluation was done the lower the likelihood of them having high result in the BPRS. At this point it is good to keep in mind that this model assumes independence of the repeated measures which in this case is not very likely.
library(lme4)
## Loading required package: Matrix
##
## Attaching package: 'Matrix'
## The following objects are masked from 'package:tidyr':
##
## expand, pack, unpack
# Create a random intercept model
BPRS_ref <- lmer(bprs ~ week + treatment + (1 | subject), data = BPRSL, REML = FALSE)
# Print the summary of the model
summary(BPRS_ref)
## Linear mixed model fit by maximum likelihood ['lmerMod']
## Formula: bprs ~ week + treatment + (1 | subject)
## Data: BPRSL
##
## AIC BIC logLik deviance df.resid
## 2748.7 2768.1 -1369.4 2738.7 355
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -3.0481 -0.6749 -0.1361 0.4813 3.4855
##
## Random effects:
## Groups Name Variance Std.Dev.
## subject (Intercept) 47.41 6.885
## Residual 104.21 10.208
## Number of obs: 360, groups: subject, 20
##
## Fixed effects:
## Estimate Std. Error t value
## (Intercept) 46.4539 1.9090 24.334
## week -2.2704 0.2084 -10.896
## treatment2 0.5722 1.0761 0.532
##
## Correlation of Fixed Effects:
## (Intr) week
## week -0.437
## treatment2 -0.282 0.000
# create a random intercept and random slope model
library(lme4)
BPRS_ref1 <- lmer(bprs ~ week + treatment + (week | subject), data = BPRSL, REML = FALSE)
# print a summary of the model
summary(BPRS_ref1)
## Linear mixed model fit by maximum likelihood ['lmerMod']
## Formula: bprs ~ week + treatment + (week | subject)
## Data: BPRSL
##
## AIC BIC logLik deviance df.resid
## 2745.4 2772.6 -1365.7 2731.4 353
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -2.8919 -0.6194 -0.0691 0.5531 3.7976
##
## Random effects:
## Groups Name Variance Std.Dev. Corr
## subject (Intercept) 64.8222 8.0512
## week 0.9609 0.9802 -0.51
## Residual 97.4305 9.8707
## Number of obs: 360, groups: subject, 20
##
## Fixed effects:
## Estimate Std. Error t value
## (Intercept) 46.4539 2.1052 22.066
## week -2.2704 0.2977 -7.626
## treatment2 0.5722 1.0405 0.550
##
## Correlation of Fixed Effects:
## (Intr) week
## week -0.582
## treatment2 -0.247 0.000
# perform an ANOVA test on the two models
anova(BPRS_ref1, BPRS_ref)
## Data: BPRSL
## Models:
## BPRS_ref: bprs ~ week + treatment + (1 | subject)
## BPRS_ref1: bprs ~ week + treatment + (week | subject)
## npar AIC BIC logLik deviance Chisq Df Pr(>Chisq)
## BPRS_ref 5 2748.7 2768.1 -1369.4 2738.7
## BPRS_ref1 7 2745.4 2772.6 -1365.7 2731.4 7.2721 2 0.02636 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Anova results show that the model with random intercept and slope (BPRS_ref1) with the p-value of 0.02636 is significantly better than the model with just random intercept (BPRS_ref).
# create a random intercept and random slope model with the interaction
library(lme4)
BPRS_ref2 <- lmer(bprs ~ week + treatment + (week | treatment), data = BPRSL, REML = FALSE)
## boundary (singular) fit: see help('isSingular')
# print a summary of the model
summary(BPRS_ref2)
## Linear mixed model fit by maximum likelihood ['lmerMod']
## Formula: bprs ~ week + treatment + (week | treatment)
## Data: BPRSL
##
## AIC BIC logLik deviance df.resid
## 2843.3 2870.5 -1414.7 2829.3 353
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -1.8252 -0.7277 -0.2586 0.5697 4.0818
##
## Random effects:
## Groups Name Variance Std.Dev. Corr
## treatment (Intercept) 2.824e-02 0.16804
## week 1.765e-03 0.04201 -1.00
## Residual 1.516e+02 12.31302
## Number of obs: 360, groups: treatment, 2
##
## Fixed effects:
## Estimate Std. Error t value
## (Intercept) 46.4539 1.3664 33.996
## week -2.2704 0.2531 -8.971
## treatment2 0.5722 1.2979 0.441
##
## Correlation of Fixed Effects:
## (Intr) week
## week -0.741
## treatment2 -0.475 0.000
## optimizer (nloptwrap) convergence code: 0 (OK)
## boundary (singular) fit: see help('isSingular')
# perform an ANOVA test on the two models
anova(BPRS_ref2, BPRS_ref1)
## Data: BPRSL
## Models:
## BPRS_ref2: bprs ~ week + treatment + (week | treatment)
## BPRS_ref1: bprs ~ week + treatment + (week | subject)
## npar AIC BIC logLik deviance Chisq Df Pr(>Chisq)
## BPRS_ref2 7 2843.3 2870.5 -1414.7 2829.3
## BPRS_ref1 7 2745.4 2772.6 -1365.7 2731.4 97.896 0
# draw the plot of BPRSL with the observed Weight values
ggplot(BPRSL, aes(x = week, y = bprs, group = subject)) +
geom_line() +
scale_x_continuous(name = "Time (weeks)", breaks = seq(0, 60, 20)) +
scale_y_continuous(name = "BPRS") +
theme(legend.position = "top")
# Create a vector of the fitted values
Fitted <- fitted(BPRS_ref2)
library(dplyr)
library(tidyr)
# Create a new column fitted to RATSL
BPRSL <- BPRSL %>% mutate(Fitted = Fitted)
# draw the plot of BPRSL with the Fitted values of weight
library(ggplot2)
ggplot(BPRSL, aes(x = week, y = Fitted, group = subject)) +
geom_line() +
scale_x_continuous(name = "Time (weeks)") +
scale_y_continuous(name = "BPRS") +
theme(legend.position = "top")
I can’t get this last plot to work correctly. Based on the anova test on
the models BPRS_ref2 and BPRS_ref1 the BPRS_ref2 model did not improve
from the first one so it can be discarded.